learning rate
Agile Online Model Selection: Resolving Adaptation Lag via Safeguarded Large Learning Rates
Takemura, Kei, Matsuno, Ryuta, Sakuma, Keita
Maintaining predictive accuracy in non-stationary environments requires online model selection to adapt autonomously to unknown distribution shifts. However, existing tuning-free algorithms face a fundamental trade-off between robustness and agility. Specifically, to ensure dynamic regret bounds, they must restrict learning rates to small constants (e.g., $O(1)$). This restriction inevitably causes significant adaptation lag during abrupt changes. To resolve this, we propose a novel optimistic online mirror descent that utilizes safeguarded large learning rates up to $ฮ(T)$, where $T$ is the number of rounds. Our key technical contribution is a post-hoc penalty mechanism that dynamically monitors unstable updates and excludes learning rates incurring excessive regret, eliminating the need for restrictive a priori constraints. We show that the cumulative penalty remains $O(\log T)$, allowing our algorithm to match near-optimal worst-case guarantees while achieving superior rates in benign cases. Empirical evaluations on synthetic and eleven diverse real-world datasets demonstrate that our approach reduces the adaptation lag from hundreds of rounds to a few rounds, consistently outperforming tuning-free baselines.
Anytime Training with Schedule-Free Spectral Optimization
Apte, Anuj, Deshpande, Pranav, Kumar, Niraj, Chakrabarti, Shouvanik, Kim, Junhyung Lyle
Standard neural network training relies on learning-rate schedules tied to a fixed horizon, leading to strong path dependence and costly re-tuning as data availability changes. Schedule-Free (SF) methods address this by removing explicit schedules, yet SF-AdamW, the current state-of-the-art anytime optimizer, consistently underperforms well-tuned AdamW baselines. We propose SF-NorMuon, a schedule-free spectral optimizer that closes this gap: with a single hyperparameter configuration, SF-NorMuon matches or exceeds tuned AdamW on 125M and 772M parameter language models across $1$--$8\times$ Chinchilla horizons. On the theoretical side, we prove a stationarity guarantee for schedule-free spectral dynamics and identify weight decay at the fast iterate as essential for long-horizon stability. SF-NorMuon enables practitioners to obtain high-quality checkpoints at any point during training without committing to a horizon in advance. By closing the performance gap with tuned baselines, SF-NorMuon makes horizon-free optimization more practical, taking a step towards truly open-ended, continual learning.
Large-Step Training Dynamics of a Two-Factor Linear Transformer Model
Gradient-flow analyses show that simplified linear transformers can learn the in-context linear-regression algorithm, but they do not explain the finite-step behavior of gradient descent at large learning rates. Motivated by empirical work on high-learning-rate transformer instabilities and by the cubic-map phase diagram for quadratic regression, we study an exactly reducible one-prompt linear-transformer training problem. After normalization, the dynamics reduce to a two-factor product map with an effective step-size parameter \(ฮผ\). On the balanced slice, this map recovers the known scalar cubic transition from monotone convergence to catapult convergence, periodic and chaotic bounded nonconvergence, and divergence. We then analyze the full two-dimensional system and show that, for \(0<ฮผ<2\), it has an explicit invariant Chebyshev ellipse separating forward-invariant regions; this ellipse carries off-balanced chaotic dynamics but is transversely repelling, while balanced scalar attractors can be transversely attracting. These results show that large constant learning rates can change the training attractor of the learned transformer rather than merely accelerating convergence: beyond sharp stability thresholds, finite-step training may settle into cycles, bounded chaos, or divergence instead of a single in-context linear-regression solution. We also discuss the consequences for mini-batch gradient descent based training methods.
ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models
Schedule-Free Learning has shown promise as a practical anytime training method for machine learning, showing success across dozens of standard benchmark problems. However, strong performance for LLM training has only been demonstrated at small scales. We identify a number of fixes necessary to scale up Schedule-Free Learning to larger batch sizes and model sizes, and present a learning-rate-free and schedule-free method (ScheduleFree+) for training large language models which greatly outperforms Warmup-Stable-Decay (WSD) schedules. We also demonstrate that Schedule-Free Learning is most effective for long duration training, and at 1000 tokens per parameter, it outperforms SOTA schedules by 31%. Schedule-Free Learning provides a theoretical foundation for the use of model averaging and checkpoint merging during pretraining.
Supplementary Material
All code can be downloaded from https://github.com/Shanka123/OCRA, Figure task is to S1: say Abstract whether Reasoning they are the T same asks (AR or dif T). ferent. Same/differ Relational ent: matc Two h-to-sample: objects are presented, A source and pair the of objects is presented that either instantiates a'same' or'different' relation, and the task is to select the pair in a 2 of tar 2 get array objects format, (out with of tw the o pairs) source th pair at instantiates presented in the the same top relation. The of task is to select the missing object from a set of four choices. Problems were presented in a 2 3 array each answer format, choice, with one see of Figure the answer S8). Identity choices rules: inserted An into abstract the bottom pattern right is instantiated cell (separate in the images first ro for w (AB instantiated A, ABB, in or the AAA), second and ro the w.
Operator Learning with Neural Fields: Tackling PDEs on General Geometries Supplemental Material Anonymous Author(s) Affiliation Address email
A.1 Initial Value Problem518 We use the datasets from Pfaff et al. (2021), and take the first and last frames of each trajectory as the519 input and output data for the initial value problem.520 Cylinder The dataset includes computational fluid dynamics (CFD) simulations of the flow around521 a cylinder, governed by the incompressible Navier-Stokes equation. These simulations were generated522 using COMSOL software, employing an irregular 2D-triangular mesh. The trajectory consists of 600523 timestamps, with a time interval of t =0 .01s between each timestamp.524 Airfoil The dataset contains CFD simulations of the flow around an airfoil, following the com-525 pressible Navier-Stokes equation. These simulations were conducted using SU2 software, using an526 irregular 2D-triangular mesh. The trajectory encompasses 600 timestamps, with a time interval of527 t =0 .008s between each timestamp.528 A.2 Dynamics Modeling529 2D-Navier-Stokes (Navier-Stokes) We consider the 2DNavier-Stokes equation as presented in Li530 et al. (2021); Yin et al. (2022).
Temperature Balancing, Layer-wise Weight Analysis, and Neural Network Training
Regularization in modern machine learning is crucial, and it can take various forms in algorithmic design: training set, model family, error function, regularization terms, and optimizations. In particular, the learning rate, which can be interpreted as a temperature-like parameter within the statistical mechanics of learning, plays a crucial role in neural network training. Indeed, many widely adopted training strategies basically just define the decay of the learning rate over time. This process can be interpreted as decreasing a temperature, using either a global learning rate (for the entire model) or a learning rate that varies for each parameter. This paper proposes TempBalance, a straightforward yet effective layer-wise learning rate method. TempBalanceis based on Heavy-Tailed Self-Regularization (HT-SR) Theory, an approach which characterizes the implicit self-regularization of different layers in trained models. We demonstrate the efficacy of using HT-SR-motivated metrics to guide the scheduling and balancing of temperature across all network layers during model training, resulting in improved performance during testing.